DOCUMENT RESUME 



ED 237 514 r v 



AUTHOR 
TITr#E 

INSTITUTION 
REPORT NO 
PUB DATE * 
NOTE * ''■ 



PUB TYPE 

EDRS PRICE 
DESCRIPTORS 



TM. 820/818 



Wainer, Howard • /- > 

Testing and Test Theory: Whither and Whence, - 

Educational TestUng Service , Princeton , N,J, 

ETS-RM'82-1 v: . • 

Jan 82 ,, v 

27p* ; Presented at the National^ .Relations'' Office of 

the, Educational Testing Service /{Washington , DC, 1 

January 9, 1981). f - ' 4 ■ \ ; T 

Viewpoints (1205 — Speeches/Conference Papers (150) 

. * «, - . . - 

< -MF01/PG02 Plus Postage. 

\Dif f iculty Level; * La tent Trait Theory; Metaphors) 
*Testing; Test Items; Test Reliability; *Tes 
Theory *' 



ABSTRACT > ■ - ;< \ - 

' "\. This paper is the transcript of a talk given . to those ■* 

who use test information but who have little technical background in 
test theory,. The concepts of* modern test theory are compared with 
traditional test theory, as well as a probable future test theory, 
The explanations given are couched within an extended metaphor that 
allows a full description of the concepts and implications of test 
;theory without utilizing any mathematics. (Author) , ' / - '■.«,"'. 



■V'. 



*********************** 

^ ReproductionB' supplied by EDRS dtm the best that can be made * 
" ? •; from the original document . .'•«; * 

******* ** ********************** * * ******* * * ***************************** 



ERLC 



j> . y 




RM-82- 1 



TESTING AND TEST THEORY: WHITHER AND WHENCE 



Howard Wainer 



PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC). 



" S. DEPARTMENT 6f EDUCATION 
NATIONAL INSTITUTE OF EDUCATION 
EDUCATIONAL RESOURCES INFORMATION 
/CENTER ("ERIC) 
documeni has bono f*prWf UC tt*r« a 
rcceivnd from the person or organization 
originating ft, 
□ Minor changes havo been made to Improve 
reproduction qQ^Uty, 



» Points of view sr opinion* § m e dm this docu- 
ment do not necessarily r O pr e50n i official NlE 
posJiigri or policy, 



£n ■: 

06 

O 

0» 




KS 



A talk given at the National Relations Office; Educational Testing Service. ..Washington D C 



9 

ERIC 



2 



y Testing- and Test Theory: , Whither and Whence^ 

*■=•" ■ ■ . 

'. '.' . «3- ' - :,' ' :• 

• ■ ' . . . ■ ' * 1 

■ . * Howard Wainer ' ' , 



ET5 Research Meradraridum\ 
RM-82-1 ■ 



Educational Testing Service 
Princetons New Jersey 085^1 

January, 1982 



"A talk given at thi Nation 
Testing Service in Washin; 
to Thomas- E. Donlon and 
comments on an earlier 



al Relations Office of the Educational :S 
gtoh, D. C. — January '9, 1981. * My thanks 
Norman Frederiksen for t their helpful 
aft, f - - " k ■ .. 



Copyright ^^^^p 1982 by Educational Testing Service* "All rights reserve" 



Thi 



Abstract 



lose who use . test 



lie^paper -is the transcript of a talk given to "the 
information' but who Have little technical background in test theory* 

The concepts- of modern" tes t theory are compared with traditional . 
test theory, as weJ^l. as a prob^kble^f.uture "test theory! The explana- 
tions given are couched within an extended' metaphor that allows^ a full 



description of the concepts and implications ^bf test* 
utilizing any mathematics,- , ■ - 



theory, without 



4 



ERIC 



ERIC 



Table of Contents 



C ■ :•■ v " , ;•' . ' ■ .; . - ■ 1 ■'; /■ ■ " 

Introduction \ , . • .-. (/ 1 

Some Aims "^f Tes ting and How They Can be Accomplished; 3 

Ability , ' : * ■ , 6 

Accuracy of Measurement" - ^7 

Item nif ficulty ' *■■ 10 

Aside ■■= , ; ' v : " " " ' - 12 

Some Examples j , 14 

Ending . • • * .'• , - • ' ' ' . ' - is 

An Idiosyncratic" Reading List ■ / ^ 19. 

References , " ■■'!.-■ 20 



"while -you and , i - ^ i oices which 'are * 

for kissing and " : v v .p "cares if some : 

one-eyed son~of -a ivc . s an instrument 



k to measure ^spring a V ^*V 

: ' ■ •■ V..' • ' ' ■ * • ' V ' ■ 

:/•;•• ", . ' .-. .e. cunwilngs^ 1926) 

' ■ ■ r * ■ 

INTRODUCTION "-\- ' ' '! - 

i On the fifth floor of this building there are two groups of 
researchers . , The first group is in complete sympathy with the 
notions expressed by Cummings f the second also enjoy kissing arid 
singing V but feel that in many circumstances it ia;important to, 

measure certain aspects of Spring* They are currently engaged in 

- • • ' ;. *" * 

identical research programs s namely ' the study of hurdle jumping . 

♦=.■-••, \ ' . ^ 

ability in humans, The strategics Ihese two groups use ^re quite 
different,, and it is instructive ^.to consider them* The first group* 
^had panels of expert coaches ftudying the movements and builds of 
various' subjects » - They had them running and jumping, , and they kept . . .' 
copious notes* Later the notes were compared and attempts were made 
to arrive ^t a consensus about each subject, . The second group had^a 
long . runway "constructed^ with a sequence of hurdles spaced out along 'it; 
the hurdles started out very low arid gradually got higher and higher, 
Each subject was instructed to run along the runway and. jump over each 
hurdle as it came, I, noticed that most would get over the lowest hurdle 



. Testing arid Test Theory ? " 

. ' .' . • ■" 2 \ 1 ' . V [ } ',' . 

easily, and wpuld a knock oyer the highest ones. They had a single '? ■ 

clerk recording the pattern of hurdles (standing or- knocked over) «." 
aft#r each runner completed the course. The runner's hurdling, * - 
ability was k somehow tie,d to the height of the hurdles that were \ 
successfully negotiated. _ •'' ' . . 

After observing this for a while I spoke with the directors of these \, 
two projects regarding their aims and the probable outcomes, Xt turned 
out that this was one part of a much ^arger enterprise, Similar 

sites are" to be set up all over the world, and careful records are to be 

. " ' ' ' ' "\ 1 * \ 

kept of hurdling ability so km to keep track of both individual improvement 
and any changes that might occur ^in the gener il hurdling ability of people 
over time* The director of the first project told me that they were . ■ 

traiLiing experts to make judgements about hurdling ability based upon , 

i . . " = ■ . . • - . ; 

*in— depth study of eacTf^inner • I asked how th^ would measure change * 

p , . ■ 

■■■• = ■ * • • •• ■ * . % 

in hurdling ability over time- both within the same individual and over all 
children ©f ^he same age* He' replied that a good coach could make .such 
judgements accurately . As I turned to go to speak to the other study 
director, I overheard two of the coache€ arguing heatedly about whether 
Joe Louis could 'have beat Mohammed All when both were at their* peaks* . 
The second study was' quite different. They had a mimeograph machine 

•* 6 •' ■ " * fc 

working at top speed turning out an instruction sheet that specified the 
number of hurdles, -the distance between them, the height of "each hurdle. 
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their order, the kind of track they must b« bn, allowable bounds of , . 

■ ; . . : : ; ■<; , ' ;' .- / . " - ... ; & . . . - ■? ■< " = • ; / 

temperature, wind , etc,,, etc, ' They were also preparing a package' ofs *H 

hurdles to be mailed, .apt to tjie; othe,r testing ■ sites At the. coffee « 

■ >w'* :; * ' ' /> v. . ; : " v ; ' \ ■ " *' / ! " •-' = v" 

machine I overheard a discussion: between two of the researchers " -7 ; : * 

■ " ' . . \ ,i .... • . . V .-V" ..... : * ■ / •■■ . . ■ v:* 

regarding how amazing the progress of - women swimmers had been ovorvthe ■ J? " 
last, ten years that the top women of today would have dominated the' 
men's Olympic team just twenty years ; earlier * 6 ' - 

/ ■ ■ ; -V -X ' ' ; y " ; " ' ■ '• ' "■■ 

My visit to trhe fifth floor reassured -jrie of the importance .of the 
work that I would like t o ~ d i s cu s s with yo u tod ay . .- * 
' SOME AIMS OF TESTIN G Ato HOW THEY CAN BE AgCOMPLTSHED *■ 

= = i - — L ■ • , v . ; a , - ^ - - - - » 

The hurdling test described above is a very good one, It embodies 
much about what we think is sensible about the measurement of human 
ability . It. also presents a context for. our discussion of test theories 
past 5 present and future \ v ' •* ■ / ' 

Often what we are interested ih when we gi^e a test is a measure of the 
ability of the examinee,. In t^he hurdlfLng test just describe^ the measure of . 
ability relates to the 'heights of the kurdles successfully cleaned. This ' . 
can be operationalized in a variety of ways - If someone performs exact ly 
according to expectation they will have a response pattern like i\ s - / '■' 

v V ; V 11111100000- - . . ; . . 

in which a f l T ^represents successfully clearing a hurdle, and a f 0 f means 



. „ ■ - . " . K- ...... 
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knocking, it over, If the liurdles are in order of increasing size we could 
than characterize a person by the height between the highest hurdle cleared 
and the lowest one missed, i. * / . - 

4 ' ' ■ ' . : ' . ■ * *■ 

., '-; " Heights (cm). 10 r 20 30 40' 50 60 70- 80 - 90 100 110 
- ' Performance 1 " 1 1 1 1 0 0 0 * 0 . 0 0 
This wdtiid .imply that we. would estimate this person^ 1 s ability, as a^put 55 cm 

, The second question we would ask is the accuracy of. this estimate* In this 

, •■ * . - * ■ % • . .= .■.. 

■#*■■.'■..« - ■ ■ , 

situation we might say that we were accurate to within 5 . cm (iVe-^^e tween 50 
"and 60 cm). Of course, thi/s presupposes tremendous consistency on tiie part 
of the hurdler, but if we found that each time the hurdler : * took the, test' 
the same,result occurred we would arrive at the same conclusion, and our, 
f confidence' in this conclusion would increase* Of - course, if we wanted more 
^accuracy in our estimate we would have to insert more hurdles, between 50 

*' •"'..». • - 

and 60 cm for this person. This is a common problem in testing, for the 
increased accuracy obtained with a more finely graduated test has an 
: :c = i ease in labor for, the- examinee (i*e. t© measure to- the nearest cm we 
would need 110 hurdles rather than the 11 used here)* If all hurdlers have 
to attempt ail hurdles* the increased labor is of little help for an indi- 
vidual whose * ability is far from the heights being used (i.e. having 
successfully negotiated ten hurdles between 10 cm and 20 cm does, not tell 
us much more about our 50 cm^hurdler ; than noting that he cleared the" 10, 20, 
and 30 cm hurdles).; . 1 f v 

Traditional practice increases the number *of hurdles in the area in 

/ , ■■ . * 

which the greatest discrimination is required* Often we can get a feeling 

. '■ = • ' '< ■■ "... ' ' 1 -N-*' " . ' ""\ ■ ' . 
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, ' ■ . , . :■■ ■ •; 5 ' ' . ; . •• ; • .' 

' ■ \ ' ■ ' v 

for the structure of the test by drawing a graph teat ■ indicates what 
the error of ^measurement is at ea en jumping level* Shown below *is a 
plot of the error 'for the test as it is now mad& up ? and another for 



"Insert plot? here of error as a function .■ " . 

of ability for two "tests = ,**"' 

one in which there were ten hurdles Citems) inserted between hurdle 5t) 

and 60 (yielding errors of 5cm)'. "• " 

Adaptive testing tries to solve this problem in another way which 

will be detailed later. ■ > * 

* *■ * • • * 

Suppose our jumper runs down the runway again, arjd this time clears 

* • • &. * ■ ■ '• . ... . . 

the 60 cm hurdle but knocks .over J:he 50 cm. This tells us that his ability 

±6 still likely to be in the 50-60 range, but our error range is expanded 

"a bit. This points out two components of error in , ability^estimation — 

. 1) the accuracy limitations ^of the test construction, , and ■ *■ 

2 J the variability ..of human performance* 

We can control the first but not the second* , 

Enough concepts have now been illustrated so that we can compare these 

aspects of testing as they were operationalized in traditional best theory 

(e.g. Gulliksen,- 1950) , modern item response theory -(e.g. Lord, 198 0ft* and 

* * * « 

in a Gedanken test theory of the future* * " . * • . - 
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y. . * V * . ■ .. . = ' H ' ■ - 5 Testing, and Test -Theory :v 

' : _" > . * . \ - -V " ; ' ■ - . • ■ ■: 

ABIL ITY . 1 ■ ■ 1 ; ■ ■ - ■ . * -. . 

- . ■ ' ■ - - ■ . ■ * , * v : , 

■ ■ ■" .' - '•' ;. h v * ;■• • .1 ' * * * * \ . . ' " * 55 * . 

Traditional Teat Theory operationallzes the examinee f s t ability r 

= •=-'*.■ * . • ' , • - • '' . '* ' ; ■ ■ ■* - f 

by the number, right . The .problem 4 with this, dori^rdtizat-ion is that \ f N ■ lf ' m ; 

'■"'.'.■'= ' •'• " ' ' • * . ' ■ ■"' •• =•* • J ' - : . - 

although it works well when a person performs,. as he ought, it offers few 

j * . "■ . 'i ' ' ? *■ 

obvious clues when 'som^ing 'is amiss, For. example/ a hurdler who * gets * 
a response vector of 1111100QX3P would have a score of 5 , as would someone ■ 
with a Victor ?of 1111000001, The XatteV- response seems clearly Erroneous , 
and one would suspect that there was either a clerical error in recording 
or the runner had sidestepped the- last hurdle (this corresponds, to , 
Successful guessing, on a -test) , It would also give a score of 5 to some- 
one who scores 0000011111, With a response pattern like this we clearly 
ought to suspect something peculiar is going on with the person taking 
the test, and a sensible result ought to be to require a' retesting* But 
with traditional ' test theory a' score of) 5 is a score of 5* and that/ s it , .. 

A second shortcoming is that* the ability Scale derived is only ordinal** 
That /is, that if one person scores 4 and Another 5 we perceive that "the dif- 
ference between them is the same "as that between someone whose score^L. ability 
9 and another It) „ Clearly changes in raw" score do not have the same meaning 

■ ■. " ■ " " \ « "' * ■ ' • 

all along the scale, even if the items are alj -evenly spaced , In addition, 
suppose , we had expanded the test so that we now had 10= i terns between 50 and 
60 cm| would an increase in score from 14 (hurdles 10_ cm - S 9\ cm )' {o* ; a s core 
of 15. (jumping over all hurdles from 10 cm through -60 cm) ^mean the same as an 
increase from 15 to 16 (clearing 5 . 70 cm)? Of course not ! . - t * , 

item Response Theory uses a nonlinear transformation of the pro s 
portion correct as an estimate of ability, /and .centers this estimate on the 



ERIC 



Testing arid TeifVrThfebry : 



^if f icui tliis^qf " the items * Further , it yields a goodness -of^i^" test that 
- wil 1 J clear ly indicate when an unusual response • pattern appears. ; Thus, ^ V, » ,= 
th e ab ili ty ^gi ven to someone who jumps over all hurdles up to and "including 
60 will be essentially the same regardless of how many hurdles were 
Intervening between the .50. and 60 cm hurdle. The only; difference* will be- 
the error estimate that is obtained . It also stretches put the "ability scale 
at the ends in such a way ,so a%v to yield ability estimates on an interval 
scmiS (i.e. observed changes have the same meaning anywhere on the scale 
an increase of ,5 has th^ same meaning whether it is from 1 to 1*5 or from' 
6 to 6.5) ._•/ Most importantly,; by keying ability estimates to the -difficulty 

^o f - the i terns we obtain a "tes^l^^ose "parameters do not /change with the 
norming ^sample * This t . is a major advance. "• \ 

Future Test Theory as I envision it will hold that ability est ima t es; : : 
have essentially the same structure as those of current IRT» except that they 
will be directly referenced to- material . In the case of the hurdling example, 
this means that a person's ability is directly related to the height of the 
hurdles jumped, and "even if no one else ever took the .tes t (how would one 
compute percentiles?) the results are valid and ojf interest (i.e. a 70 cm yj\ 

^hurdler is* a 70 cm hurdler no' matter what else happens) arid" we can measure 
progress in a metric that makes .good sense. * "■ .. = t ; / 

ACCURACY OF MEASUREMENT " * 1 - < 

Traditional Test Theory - Traditionally, we would measure the accuracy- 
of the assessment of a person' s« ability estimate" by looWmg at how the test 
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/ -i%Cy^. ' : \ < ? ; - • ; : - ; : /-i 

: orders a group of people who have taken the' test twice (or manufactured ■ •"- 
two versions of the test artificially using, say odd and even items) . If 
a test orcjers the people essentially the same way on two testings we say it 
is 'reliable*. The extent to which it does this is its /reliability 1 . 
Let us look at the components of reliability. First, it depends upon the 
discriminating- ability of the items (if the test had only two heights "of 
hurdles , 20" cm and 20 feet, we would find that the test was not particularly 
reliable, since virtually everyone would perform the same way on the test) . 1 
Second, it depends upon the inherent ' variability of the individuals -being 
tested (if on one administration a person cleared every .hurdle and on the, . 
-next administration cleared none, we would be hard pressed to assign an 
ability estimate) - Third, it depends upon the variability of ability in 
the sample being tested (if everyone had the same ability their ordering * 
from one administration to 'the . next would vary 'enormous ly thus indicating 
an unreliable test ? when, in fact, the test could be quite good) « This is 

. one of the gravest problems with traditional test theory. = the reilance on 
reliability* It can make the character of. the . test appear % to change with .- . ■ 

, changes in the population be^ng tested —^"should the accuracy of a scale 
change depending upon who wa^^e^n^w^ighed the day you stepped on it? 

The effect of the norming group on test reliability and validity 
is not merely a statistical curiousity. It occurs often, and can stir 
up trouble when .it is not understood. For examples it is common to f¥rtSfc» 
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among first year law students = that.^ the correlation between s che .student 1 s ; -T 
LSAT scores and tlfeir undergraduate gjr'ade- point averages is zero, or even- - \ 
.negative* ^Ihis is sometimes used as evidence against the validity of the 
LSAX- <■ The actual reason is that the JLSAT was used (properly) km an 
admission device along wi-th UGPAr This makes the admitted group much more 
homogeneous, and so reduces the correlation between the measures , .One can/, 
understand why It goes negative by thinking about the bivariate distribution,* 
Some students will do poorly qn. both measures; and, therefore, not be ; V 
admitted* Others will do well on both - they' go to Harvard. Thus, t.he . r --- 
students who attend most law schools have done relatively better on one 
than on the other,- and so were admitted Surely, a measure of a test's 
ef ficacy : should not depend upon who is, being measured, ^ =.-_"..-■. 

Item Response Theory ..- IRT deals with estimating the accuracy^ a 
test" through .the s tandard error -of estimate. This is essentially a function - 
of the item structure,, thus, if we- have, items . spread every -10 cm at one p^et 
of the test, a person whose ability^ falls>,in that .part will be accurate to /. 
within 5 cm, 1 If in another part of the test- we Jiave hurdles every 0entimeter 
then- the; error at that part of the . test is of the' order of .5 cm. This is 
regardless ^pf the' variability of, ability of those people taking the test, 
In facts there need onfy be one person taking this test. The problem ^with 
this is that it does not take into account the; .variability of the person 
taking the test (actually it does , but not as an individual , only on average) 
Thus, the error estimate is sometimes a bit' on the optimistic side* 
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Future Test Theory - This sort of test theory would .expand or contract 



the error estimate currently in use on IRT's by the variability of the person 
akirig -the test, therefore yielding a more honest estimate of variability, 
■but nonetheless^ ona 'that is still independent of the group, taking the : .- 

test.: Thus, we ought to be able to give a better estimate for a person who* 
responds 1111100000 than one who responds IlllOljOOOQ, or 1101001010, : 

.ITEM DlW'ICULTY ... .": ■. ■ ' \- ■ V ■ " " • * ' * ' / 

' . Traditional Test Theory - .The J difficulty is not well defined in the 
classic treatment (Gulliksen, 1950), in fact it is not listed, in the index 
as a termi "A careful search turns up (p. 367 f<f ) a" variety of definitions . 
that -relate trie difficulty of an item to the proportion of individuals that 
get an item correct within a particular sanujle* - This is the grist that will; 
eventually be made into "a measure of difficulty, but at the time it was merely 
one way of doing it. A p roblem with this £ o rmurat ion i s t ha t • di f f iculty - , 
changes from one norming group to'.another- A more^erious problem is that . 
the concept .of -difficulty is not functionally tied Ip'the concept of ability- . 

I tern , Re s p ons e Theo ry — The most f undamenjrfil .concept of IRT is the 
functional relationship between the ability of a person and the difficulty of 
the item. Referring back „to the hurdling test, j:his ineans that we can ^describe " 
the likelihood of a person successfully negotiating a hurdle of a particular 
height as a function of the ..difference between their jumping ability and the : 
height of the hurdle. The way that difficulty is defined is as a function of 
the proportion of individuals who an^we r it correctly. A hurdle that is almo s t 
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never; scaled is called f difficult r . One that is almost never toppled is 



considered 'easy* . These same designations are possible with traditional 
test theory., but IRT defines ; 'difficulty ; f quantitatively , s arid unifies it 
with ability within the context of a, theoretical structure. • '> 

There is an apparent circular! ty here that needs to be explicated. ,#:= . "v 
Item difficulty is defined by the proportion, of people iwho answer that 
item correctly,- Person ability is defined by the proportion of ^i terns that 
a pers on answers correctly- Yet I stated th at I RT- was re la t iye ly unrelat e d 
to such things as normlng groups 'l and that the. accuracy of - estimate didh* t - 
depend upon who else took the test . How does, this \ follow? The critical :• ; 
concept here is one of the dl f f erence between ability and difficulty being 
the variable of interest* Suppose we gather, a groups of indfvidutois to take 
.the hurdling test- We have ^no idea of f the difficulty . of the items nor the 
ability o'f the people . Quickly , we find that some hurdles -aire easier to 
jump than others, and some people are moire skilled jumpers*. Through the 
intervention, of the IRT model we obtain ability estimates for the people 
and difficulties .for the 1 items - of course s they are not correct' with . 
respect to origin, but they are correct relative to one another. If we v • 
now give the same items to another group of subjects we can calculate , - 
their ability on the ^ame' scale as the first group. Or we can • - ^ ' • J 
have the same group jump different hurdles and calibrate "those ; urdles on 
the s ame- scale as th e ori ginal" ones • Note t ji at we can choose a subset of the 
briginal -hurdles to measure some new group of people and do it oh the same 
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scale* * Then, if we suspect- that a particular group of *n6w examinees may ' \ / i 
: ; have greater ^ability (f or hurdling they may be . taller, or olde.r) ^e mights v 4 ..v-;' 

givp them f ewer" short hurdles and more high ones (and so increase t^e . = t ' -">.'. ' * 
accuracy. of the test in > the area of their anticipated ability) . This aspect j v * 
oi IRT is called i'l'tem Independent Person Measurement" ~ being? able to * - %6l = " 
measure individuals .on the same scale in a way that is independent of the 
precise set of items chosen. - Thus, we can choose items in such a way so as ' s 
to reduce error o>f measurement., * "This is important - in computerized adaptive F _ w ; 
testing (more later) . /*; ■ - - . '. . * - >.;-*- - - "■ 

In addition, because the items are calibrated by the difference between 
their -difficulty and the ability of the norming group we can estimate item 
difficulty from any. group, if we are sensible about -matching items with 
" people so that they are 'reasonably suitable/^ IRT can ;giyfe us protection - 

from our ignorance, ^ut not -jfr^bm stupidity- . The characteristic of, IRT that w 

' . ' ■ ■ ■ . . . ' . . • • & .'_ . - ' " •• • ' • 

allows us to . calibrate items on any group of individuals is called "Sample - . •. 

free item calibration 1 *-. • ■ *V . . " . 

a ASIDE - Actually" if we ,dp make a mistake and use the wrong calibration. 

sample^ the/ model will tell us so; Consider what the ability estimate would' be 

for a person whose hurdling vector is lljlllllllw We know that he can clear • " 

■",••* . " • i tc \ = ' * : ' '■' " ' - * y ' 

. 110 cm, and- we presume that he will miss 1 one of infinite height, so that we - ' 
can assign* an ability estimate between those two extremes with a huge error. 
Such a huge error of estimate tells ais that "we don't have enough, hurdles in the 

•a ■ -= ■ ' . /.v.. 

■ As long as we acquire usable information from them ~ 'i . e . if the hurdles are so . 
"much .beyond the ability^ of the norming group that they are- almost never scaled 
we .cannot . get an estimate of ; their difficulty other than that they are too .- 
difficult for this group -^-.similarly, if i they are almost . never missed we cannot 
estimate their difficulty. - - ; - • * v ;y .-'SV : ' ^ 
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. appropriate ratfge for this person^ -4^ item : * 

calibration. Of course ^ we needn' t have teen, so axtriml /as to choose r> ; - 



a*^r41#; £C-tofiiii^ of height 

300 cm as one that would not have been scaled, and consequently calculate 
a mare realistic estimate of ability and ertor.. for tne^'perfeefc hurdler \:y 



The_ insert ion o^ ^laus ibl^e bounds on abilil^^ 

V : : ' . i • ■- ■ ;-'•'.= . " ' ' "' . : ; £ ; . : - • • • *H * -^Si^ • :• .:• ! : r': r - 

estimation: such so-called "Bayesi'an-" methods appear to be a 1; ru i t f ul -,. 

p l a>t h for future me t no do io gy \ : ;^ *. r .v ' v *\ V • T 

Future Test Theory - The shortcoming of the estimations of difficulty - 

within the context of JRTv is that it is roperatibnally ^^t^, to the people'. 
,-. If, somehow, .this .could be separated from tile . people taking the test- and, assessed 

independently it would allow us* to make much more powerful " conclusions , Let tie 

consider the, hUrdling test £ga£n. Suppose we kept track of how often each hurdle 
/ was successfully x j ump ed , and then we began to make careful physical measure— # 

ments of the hurdles themselves — . their height and color, the distance between - 

them, wh era they occur in the .test, etc. Suppose we then tried to correlate 
; these physical properties with the observed difficulties . . It might be that" 

we would find that we could predict the difficulty of a hurdle from these 
- physical measurements. We could then produce a new hurdle, and before atiy- 
,» .one had actually tried- it be able tp predict those individuals who would or 

• would not be successful in jumping it. Obviously,^ in this, example height 

..." : j • •. ... - * • ........... ■ : . - , ' 

Us the ^tuoial variable , and "we have faith that if we have someone whose 
hurdling ability is t estimated to be 60 cm + 5 and we .present this person 
with a hurdle of 47,3 cm we can be pretty well assured that his chances of 
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successfully negotiating it are greater than Abfeeone {whose ability "was 7 : 
B^sured= to be ^50 cm. = How much greater, "and precisely what is each ■ ' 
person's probability of clearing this untried hurdle requires the . 



actual use of the IRT models The use of expert judges might also enable. 

; - L . \ [ ' ' "* '' ';;y / ..r .- • " ;." ; . ' . ' ■ 

qualitative judgements to be made; it is tbe precise statements of the " '•- 
likelihood of success and error bounds around these- statements that ^ are : 
the strength of .the mathematical model. -.* . * 3 : 

Traditional Test Theory .— - Most tests are , still scored using tradi- . 

... ... _ , . ■ ■ ■ - ... ------ ----- - • . " ..' " ,~ . " - .* ■ ~ r 

tional true score theory* among them the. SAT, the LSAT, the ORE ^ and 
. virtually all of ETS's tests. . There. are many "reasons for this. Three 
factors that come to mind are: . inertia (theories 4 don 1 1 die, just the 
■ people who believe in them) / a desire to maintain'' comparability with 

past performance, and some technical problems associated with the use of : • 

• IRT on l^rge scale tests • ' . " ; ,-, /■ - r .' ' , . . V 

; - ' IRT - TOEFL (Test' of English as n Foreign Language) is equated" using - 
IRT; a qualifying examination given to prospective "physicians by the National 
Board of Medical Examiners uses 'IRT' for both scoring and equating; many - 
small scale tests (Wainer, -1980; Bock & Fitzgerald, 1972) are examples, At ' • 
ITS a variety of careful studies are underway that were, designed to explore 
the efficacy of, changing to. this model on some of the large testing programs 
(the GRE and SAT are currently being scrutinized as to their 'suitability for 

, the application of IRT) and a new International Aptitude Test (sort of an SAT 

given in, other countries) is being considered for IRT use, for calibrating 

• - . ' • ' ■ • • • ' " ' " 1 ■ x r >' ' -I " ' • 

and equating,. \ : v \ . .:: J/ ' '/ ; ' \:\ 
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• - T^st^Theoi^^df ; the Future I have •brought with me one teat that XV 
consider _fc^^ retype of a test of the future* it is the ; 

Degrees of Reading Power (thfi DRP) test currently being used In New York ** ' 
SHata to measure reading eompr ehens ion - It .has a"variety- of Characteristics 
that make it vefy >s&ecial. " A sample f item*' is shown below* It uses a 



•* 1 ' 4 . .\r> - Insert a Sample DRP Passage -Here 



modified f ClozeJ procedure for testing, and it has been f oundj^ttiat:^ the 
difficulty of the questions is almost' perfectly predicted by the readability 
of the passage* The readability is obtained through a weighted combination, 
of several physical characteristics of the prosfi (mean sentence length 9 mean- " 
■■ word length , and the mean frequency of occurrence of the words used in t . 

ordinary English - prose) . Thus, we can measure the 'height of the hurdle' . "' ■ 
» ; in that we can score any piece of exp6^ktory prose for readability with , .. ' - 
"a computer program, and then, predict rather accurately how "well: someone whose 
ability has been assessed with, tffie DRP can read it. Further, it means fchat we .. 
can criterion-reference the -aiility estimates by showing the height of the 
hurdle that a person with a particular ability ca:n successfully scale ~ ■ . : 
i.e. 'your .child can read The Daily News with 90% 3 comprehension, the Times 
with 70% compr ehens ton, and, the New York Review, of "Books with 50% comprehension/'' 
Therefore, t,he teacher with the aid of the - DRP can assign reading materials 
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^ T bogs "he^ blind people. "... Many rilind people 
stay hernia \ Sj^sy will rb^ go'outei^e alone, • 
Tteygan not Sfe- So they are afraid. They 
jtiiink* they mjiglht' fall, Ov get hurt. Or get % -]~ 

lost.. - • 7 ^ , • • .. ; • £ 

; Such fears are not foolish. * There really ' 
•" are many ffl^ffl •; i ' • ; r 'Blind people often - 
need help- But they my not ask people for it. > 
may get a dog. It is a seeing eye dog. v 



It sees for them* 



The 



'helps a lot. It is- a' 



guard. It is a friend. It is a leader. Man 
and dog go but together ; Ihey come *to a 
corner. Ihe^dog stops. He looks ? He listens. 
He thinks* pe crosses when, it is safe.- The 
qiog sees if anything is the matter. There may 
be a fence. Or a hole. Or water. - He stops. ■. - 
TWen;he shows ±0t@ safe: way. Then they go on, 

• The dog must obey* But he must also 
know when not to obey ; Good ; % .' '■ • 
irtportant. The man may say "to." But a car . 
may be coming* ttai the dog. must not go. 



la) 
cj 



jobs" " ' W masters ^ : 
dangers d) expenses 
■ e)-; tests 



2 a) 
c) 



animal b) doctor 
exercise d) sound "• 
m)\coXor 



3 a) 
c) 



progress ,b) health. 
OGHipany d)f food 
e) sense 
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to children that are \sul table, and the ^parent can more readily understand 
the progress that his/her child has been making* = . 

~ * Such schemes - as— tha t o f - £h e : DKP-s eem po as ibl e„ with, tests, of skills_ 1 ike* [ 
reading and arithmetic, but less so with teists of knowledge like history and 
economics * Nonetheless, it seems' that future tests should aim toward assess- . 
ing the difficulty of their t items independently of the examinees. Such 
assessments are usually called -content validity* , but" through the use of 
1RT models we, are able to parameterize this concept and specify the relationship 
between the content validity: of the tests ai\d the ability of the examinees, ; 
V: A second area, of improvement in future testing is in the determination of 
which items will be presented to an examinee. As. was pointed out earlier, the 
accuracy of . assessment of ability if partially '_ dependent upon the fineness of 
the difficulty gradations #in ; the vicinity of each Examinee's ability. To make 
the entire test finely gradated can make the test over long, tedious, and intro- 
duce extraneous (albeit perhaps interesting) factors into the determination., 
of success (grit, determination , endurance , etc . ) . The usual alternative is 
to. 'peak* t^e test in the same area, as the ability distribution Ci.e. have 
more items in the middle of the ability range than in the extremes) or, if the 
test is to be used for selection, peak the test in the crucial area of selec- 
tion (i.e. if a .child -has to read at a particular level of competency to be ' 
prgmoted into the. next grade s have most items at that level of competency) • 
A future improvement is what has been called "Computerized Adaptive Testing 11 



•'•""••V " ;\' .Resting and Test Thaofy:; *' • 

V '.. . ; „' 17 . r - : ' 

(CAT) - In this application a computer presents items roughly in the , - 
middle of the ability range. If an individual gets /them .right it 
presents more difficult items; however, if he gets them wrong easier 
.items are, presented* 'This minimizes the number of too-easy items / 
that could bore the examinee and top-difficult items that can frustrate 
him/ her,' It also reduces the likelihood of blind guessing, since the, ; 
examinee will only rarely be facing an item entirely beyond his ken* To 
see how this sort of scheme would work s suppose we presented a hurdler 
with a 50 cm hurdle, If this was cleared' the next one would be 75 cm* 
If this was missed* one of 62*5 cm would follow- If this was cleared, one 
of 68,75 cm would be given; and if this was missed, one of 65 , 625 cm, and 
so on. The -distance between what was passed and what was failed was con- 
tinually halved until the ability of the individual was estimated with 
acceptable accuracy, Note that in the example above the hurdler, had faced 
only 5 hurdles ? and we were able to estimate his ability to within 3 cm* 
This wduld have been the ease for anyone whose ability lay in the cange 0 
100 cm. Note that to * have done this with a conventional test we would have 
required hurdles; every 6 cm and the hurdler would have faced 11 hurdles, 
before missing* Thus, the length of the test has effectively* been halved*; 
arid for greater accuracy the savings would have been greater*, A further 
advantage is that the examinee who behaves in a regular 'way gets a very 
short testj &nd can leave. Someone who behaves in an irregular way stays 
longer. Thus 9 the length of the test required to obtain a fixed amount of 
accuracy varies with the examinee* If a test is inappropriate for a ' 



_ . ... * \ .'.;„'.. ■ , '.*-.."-''•-'• ,,>'-../.. "Testing and Test Theory ; 

particular examinee, this procedure will not converge within a ^reasonable 
amount of, time and so cue the tester to the problem. * V 

" ENDING " - • " " : " — — — --— — ■}•-- . - - - . -■ 

As I left the fifth floor to come down here to present this material', 
I paused to watch a very athletic young man run ov&r the hurdles . 1 
marveled at his grace/ and elan, as he cleared all but the highest hurdles. 
We chatted briefly, while the clerks were talliying up his score, He told 

me that he was in the other study as well, and the 'coaches there had 

.j ,: ; . *• - '-«..«•-• . ■ • * . : .-."."..*.'"'■.'.'.' "•' '" ; " 

classified him a THREE.; I said that must indicate a very high rating - 

because he ran the hurdles so wellv^ He replied modestly that the onlv 

reason ,1 thought he was so good was "because a „ THREE was ithe best- I had 

seen.. But that my judgement would be - quite different had I ever met a 

FOUR, I replied that most of this talk was a metaphor* ~* . "; 
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> ^Gpliiksan, H % (2950) Theory of Mental Tests . ~ New^YorJy John Wiley, ;& Sqns. 
1 ._ This is "the flr^t complete, treatmeiit of true score theory. It explains 
1 the various -concepts of .mental testing clearly and unambiguously » and .■- 
"v. provides many examples * . . / - ; - ,' 

RaschV^G- (1950) Probabilistic Models "for "some* intelligence and attainment \ 
■ tests - Copenhagen: Nielson and Lydiehe (for Denmarks Paedagogiski Institut) . 
v : " Republished in 1980 by the University of Chicago Press^: Chicago, This is 

"a complete statement of; the simplest item response theory, the one parameter r 
\ v TOdelt- often ; elided -"itie ^ Raich_ : Moder n af ter its originator. Besides, de~ 
:■■ : - tailing a test theory model he also explains why this "model, mus t be thej one^ 
• . employed on measurement theory grounds.* , •• 

Thur stone, L.L, and Chave E. J. (1929) The measurement of attitude, Chicago: 
" " The University * of Chicago Press, In this book 'Ehurstone ( the Originator of 
virtually all of modern psyehometr ics) develops much of the methodology „ .... 
_____ that will eventually be called Item Response Theory. He does it in the • 
- * * context of attitude measureittent 9 " but it is,.. all there* ... ■ . v - ,_ = 

Lord, F,M. and Noviek, M.L. - (1968) Statistical Theories of Mental Test Scores . ' 
; New York: Addison-Wesley . "This; book puts the capstone on classical true 
* score theory, developing it from first principles, and showing all /of its 
V useful aspects* It also presents., the details of item response theory in > 

- the f ourt.chapters by Birnbaum.K_jfrhus , in. one sense it can be thought of as 
- \ ending one era- and. beginning the next * 

Lord, (1980) Applications of item response theory to practical testing;* 

* problems . New York: Lawrence Erlbaum Associates, The -next twelve years 
>vf IRT from the leading expert in the field. This describes th^ develop- 
ments of 1RT sirice the publication of Lord, and Novick, and shows how to ^ 

' ; use IRT, .to solve practical problems* 2>B ' - 1 ^ 
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Wright (1979) Best Test Diesiga , /MESA grass s vChlcaigfe^ 

* A "practical 'handbook on Rasch Analysis , it is to Rasch 1 s book and v 
the 1 parameter modal what Lord's book is to the mult iparamatar-^Bl^ 
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